Distinguishing Humans from Bots in Web Search Logs
نویسندگان
چکیده
Cleaning workload data and separating it into classes is a necessary pre-requisite for workload characterization. In particular, the workload on web search engines is derived from the activities of both human users and automated bots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effects of bot activity. However, available workload data is not accompanied by labels that can be used as a basis for learning and generalization. To cope with the lack of labeled data, we suggest using two mechanisms. The first is to employ two thresholds for each criterion, enabling the identification of users who are most probably human or most probably bots according to need, and avoiding ambivalent cases. The second is the notion of “strong” criteria, which identify levels of activity which are highly unlikely or even impossible for humans to achieve. We then use an iterative process of refining the thresholds to combine the results of multiple metrics in a mutually consistent manner. Results using the AOL log identify over 92% of the users as human, and only a small fraction (0.6%) are probable bots. The humans tend to display relatively consistent behavior, whereas bots may exhibit markedly different behaviors. In particular, it is not uncommon for a bot to be very different from typical human behavior according to one criterion, while being indistinguishable from a human according to another.
منابع مشابه
Image flip CAPTCHA
The massive and automated access to Web resources through robots has made it essential for Web service providers to make some conclusion about whether the "user" is a human or a robot. A Human Interaction Proof (HIP) like Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) offers a way to make such a distinction. CAPTCHA is a reverse Turing test used by Web serv...
متن کاملBot or Not? A Case Study on Bot Recognition from Web Session Logs
This work reports on a study of web usage logs to verify whether it is possible to achieve good recognition rates in the task of distinguishing between human users and automated bots using computational intelligence techniques. Two problem statements are given, offline (for completed sessions) and on-line (for sequences of individual HTTP requests). The former is solved with several standard co...
متن کاملDistinguishing Humans from Robots in Web Search Logs
The workload on web search engines is actually multiclass, being derived from the activities of both human users and automated robots. It is important to distinguish between these two classes in order to reliably characterize human web search behavior, and to study the effect of robot activity. We suggest an approach based on a multidimensional characterization of search sessions, and take firs...
متن کاملBots are Users, Too! Rethinking the Roles of Software Agents in HCI
Increasingly sophisticated autonomous software agents called ’bots’ roam throughout the Internet, performing a wide variety of tasks, some for good and some for evil. Yet while autonomous, these bots are not artificial intelligences, instead programmed to perform mundane, routine tasks that would otherwise be impossible by humans. Useful bots crawl the web for search engines, enforce order in I...
متن کاملDo Bots impact Twitter activity?
The WWW has seen massive growth in population of automated programs (bots) for a variety of exploits on online social networks (OSNs). In this paper we extend on our previous work to study the affects of bots on Twitter. By setting up a bot account on Twitter and conducting analysis on a click logs dataset from our web server, we show that despite bots being in smaller numbers, they exercise a ...
متن کامل